Combining Unigrams and Bigrams in Semi-Supervised Text Classification

نویسندگان

  • Igor Assis Braga
  • Maria Carolina Monard
  • Edson Takashi Matsubara
چکیده

Unlabeled documents vastly outnumber labeled documents in text classification. For this reason, semi-supervised learning is well suited to the task. Representing text as a combination of unigrams and bigrams has not shown consistent improvements compared to using unigrams in supervised text classification. Therefore, a natural question is whether this finding extends to semi-supervised learning, which provides a different way of combining multiple representations of data. In this paper, we investigate this question experimentally running two semisupervised algorithms, Co-Training and Self-Training, on several text datasets. Our results do not indicate improvements by combining unigrams and bigrams in semi-supervised text classification. In addition, they suggest that this fact may stem from a strong “correlation” between unigrams and bigrams.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Classification by Augmenting the Bag-of-Words Representation with Redundancy-Compensated Bigrams

The most prevalent representation for text classification is the bag-of-words vector. A number of approaches have sought to replace or augment the bag-of-words representation with more complex features, such as bigrams or partof-speech tags, but the results have been mixed at best. We hypothesize that a reason why integrating bigrams did not appear to help text classification is that the new fe...

متن کامل

Analysis of Polarity Information in Medical Text

Knowing the polarity of clinical outcomes is important in answering questions posed by clinicians in patient treatment. We treat analysis of this information as a classification problem. Natural language processing and machine learning techniques are applied to detect four possibilities in medical text: no outcome, positive outcome, negative outcome, and neutral outcome. A supervised learning m...

متن کامل

Using Bigrams in Text Categorization

In the past decade a sufficient effort has been expended on attempting to come up with a document representation which is richer than the simple Bag-Of-Words (BOW). One of the widely explored approaches to enrich the BOW representation is in using n-grams (usually bigrams) of words in addition to (or in place of) single words (unigrams). After more than ten years of unsuccessful attempts to imp...

متن کامل

Using Skipgrams, Bigrams, and Part of Speech Features for Sentiment Classification of Twitter Messages

In this paper, we consider the problem of sentiment classification of English Twitter messages using machine learning techniques. We systematically evaluate the use of different feature types on the performance of two text classification methods: Naive Bayes (NB) and Support Vector Machines (SVM). Our goal is threefold: (1) to investigate whether or not partof-speech (POS) features are useful f...

متن کامل

SpamBayes: Effective open-source, Bayesian based, email classification system

This paper introduces the SpamBayes classification engine and outlines the most important features and techniques which contribute to its success. The importance of using the indeterminate ‘unsure’ classification produced by the chi-squared combining technique is explained. It outlines a Robinson/Woodhead/Peters technique of ‘tiling’ unigrams and bigrams to produce better results than relying s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009